Enhancing Lemmatization for Mongolian and its Application to Statistical Machine Translation
نویسندگان
چکیده
Lemmatization is crucial in natural language processing and information retrieval especially for highly inflected languages, such as Finnish and Mongolian. The state-of-the-art method of lemmatization for Mongolian does not need a noun dictionary and is scalable, but errors of this method are mainly caused by problems related to part of speech (POS) information. To resolve this problem, we integrate POS tagging and lemmatization for Mongolian. We evaluate the effectiveness of our method and its contribution to statistical machine translation.
منابع مشابه
A Lemmatization Method for Modern Mongolian and its Application to Information Retrieval
In Modern Mongolian, a content word can be inflected when concatenated with suffixes. Identifying the original forms of content words is crucial for natural language processing and information retrieval. We propose a lemmatization method for Modern Mongolian and apply our method to indexing for information retrieval. We use technical abstracts to show the effectiveness of our method experimenta...
متن کاملNRC Russian-English Machine Translation System for WMT 2016
We describe the statistical machine translation system developed at the National Research Council of Canada (NRC) for the Russian-English news translation task of the First Conference on Machine Translation (WMT 2016). Our submission is a phrase-based SMT system that tackles the morphological complexity of Russian through comprehensive use of lemmatization. The core of our lemmatization strateg...
متن کاملRealignment from Finer-grained Alignment to Coarser-grained Alignment to Enhance Mongolian-Chinese SMT
The conventional Mongolian-Chinese statistical machine translation (SMT) model uses Mongolian words and Chinese words to practice the system. However, data sparsity, complex Mongolian morphology and Chinese word segmentation (CWS) errors lead to alignment errors and ambiguities. Some other works use finer-grained Mongolian stems and Chinese characters, which suffer from information loss when in...
متن کاملCoping with Problems of Unicoded Traditional Mongolian
Traditional Mongolian Unicode Encoding has serious problems as several pairs of vowels with the same glyphs but different pronunciations are coded differently. We expose the severity of the problem by examples from our Mongolian corpus and propose two ways to alleviate the problem: first, developing a publicly available Mongolian input method that can help users to choose the correct encoding a...
متن کاملAdapting Attention-Based Neural Network to Low-Resource Mongolian-Chinese Machine Translation
Neural machine translation (NMT) has shown very promising results for some resourceful languages like En-Fr and En-De. The success partly relies on the availability of large scale and high quality parallel corpora. We research on how to adapt NMT to very low-resource Mongolian-Chinese machine translation by introducing attention mechanism, sub-words translation, monolingual data and a NMT corre...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012